Skip to content

CV-Neural Network

Cross-entropy and its usage?

Cross-entropy loss is a commonly used loss function for classification tasks, particularly in multi-class problems. The loss function is defined as:

L(y^,y)=k=1Kyklog(y^k)=log(y^k)

where:

  • yk is the ground truth label (one-hot encoded).
  • y^k is the predicted probability for class k,
  • K is the total number of classes.

Usage:

Multi-Class Classification: Cross-entropy is widely used in deep learning models for tasks like image classification (e.g., CNNs, transformers).

  • It is often paired with the softmax activation function in the output neural networks.

Advantages

  • Encourages confident and correct predictions.
  • Can handle imbalanced data by incorporating class weights.

How do we know if we are underfitting or overfitting?

Cross validation: measure prediction error on validation data.

Underfitting

  • add more parameters (more features, more layers, etc.)

Overfitting

  • remove parameters
  • add regularizers

Regularization

θ=argminθi=1NL(fθ(x(i)),y(i))+R(θ)R(θ)=λθ22

What are used to add nonlinearity to network? Compare?

Nonlinearity in neural networks is introduced through activation functions. These functions are applied to the output of each neuron to introduce nonlinearity, enabling the network to learn complex patterns.


Common Activation Functions

Activation FunctionFormulaRangePropertiesAdvantagesDisadvantages
Sigmoidσ(x)=11+ex(0,1)Smooth, differentiableUseful for probabilistic outputVanishing gradient problem; not zero-centered
Tanhtanh(x)=exexex+ex(1,1)Smooth, zero-centeredZero-centered output; better gradient flowStill suffers from vanishing gradients
ReLUReLU(x)=max(0,x)[0,)Sparse activationEfficient; mitigates vanishing gradients"Dead neurons" if weights drive inputs negative
Leaky ReLULeaky ReLU(x)=max(0.01x,x)[,)Allows small gradient for negativesAvoids dead neuron problemSmall gradient for negatives limits learning rate
PReLUPReLU(x)=max(αx,x)[,)Learnable negative slopeAdaptive negative slopeRisk of overfitting due to extra parameters
SoftmaxSoftmax(xi)=exij=1Kexj(0,1)Converts logits to probabilitiesUsed for classification tasksNot for hidden layers; computationally expensive
SwishSwish(x)=xσ(x)(,)Smooth, differentiableImproves training; no dead neuronsComputationally expensive
GELUCombines ReLU and Sigmoid concepts(,)Smooth, differentiableBetter for Transformer modelsSlower than ReLU

Comparison of Activation Functions

1. Vanishing Gradient Problem

  • Sigmoid and Tanh squash large inputs, leading to small gradients for large inputs, which hampers training in deep networks.
  • ReLU and its variants alleviate this issue by allowing gradients to flow for positive inputs.

2. Computational Efficiency

  • ReLU is computationally simple (max(0,x)), making it efficient for large networks.
  • Swish and GELU are more computationally intensive.

3. Sparse Activation

  • ReLU and its variants deactivate neurons for negative inputs, improving computational efficiency.

4. Zero-Centered Outputs

  • Tanh is zero-centered, helping optimization by balancing gradients.
  • Sigmoid is not zero-centered, potentially slowing convergence.

5. Handling Negative Inputs

  • Leaky ReLU and PReLU handle negative inputs, avoiding dead neurons.
  • Sigmoid and Tanh output non-negative values, which may not be ideal in some cases.

6. Probabilistic Outputs

  • Softmax is used in classification tasks to produce probabilities for each class, typically in the output layer.

Choosing an Activation Function

Hidden Layers:

  • ReLU is the default choice for simplicity and efficiency.
  • Leaky ReLU or PReLU for avoiding dead neurons.
  • Swish or GELU for modern architectures like Transformers.

Output Layers:

  • Sigmoid for binary classification.
  • Softmax for multi-class classification.

Forward and backward propagation

How to make the network deeper?

Adding More Layers

How: Simply add more hidden layers between the input and output layers. This increases the depth of the neural network.

Why: More layers allow the network to learn more complex, hierarchical representations of data. Each additional layer can capture higher-order features, which makes the model capable of handling more abstract patterns. By increasing depth, the model's capacity to learn complex mappings between inputs and outputs improves, making it suitable for tasks like image recognition, language modeling, and other advanced problems.

Residual Connections (Skip Connections)

How: In architectures like ResNet, residual connections are used where the input to a layer is added directly to its output, bypassing the transformation at that layer. This is often referred to as "skip connections."

Why: Residual connections address the vanishing gradient problem by allowing gradients to flow more easily during backpropagation, even in very deep networks. This makes it easier to train deep networks without worrying about gradients becoming too small to update the weights effectively. These connections also help the network maintain performance by allowing it to learn both the residual (new information) and the identity (previous knowledge) mapping.

Stacking Blocks of Layers

How: Instead of adding individual layers, a network can be built by stacking blocks of layers. For example, a convolutional block might consist of several convolutional layers followed by pooling layers. These blocks are repeated multiple times to form a deeper network.

Why: Stacking blocks of layers allows for more efficient learning. Each block can perform specific types of feature extraction (like edge detection in convolutional layers), and by stacking them, the network can learn progressively more abstract features. For example, early blocks may detect edges in images, while later blocks can recognize more complex shapes or objects. This modular approach also improves the reusability of network components, which makes training deeper networks more manageable.

[[CV-Attention]]

[[CV-Object Detection]]

[[CV-Generate Model]]